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Abstract 

In this paper, we focus on the question of 
the extent to which online learning can bene- 
fit from distributed computing. We focus on 
the setting in which N agents online-learn co- 
operatively, where each agent only has access 
to its own data. We propose a generic data- 
distributed online learning meta-algorithm. 
We then introduce the Distributed Weighted 
Majority and Distributed Online Mirror De- 
scent algorithms, as special cases. We show, 
using both theoretical analysis and experi- 
ments, that compared to a single agent: given 
the same computation time, these distributed 
algorithms achieve smaller generalization er- 
rors; and given the same generalization er- 
rors, they can be N times faster. 



1. Introduction 

The real world can be viewed as a gigantic distributed 
system that evolves over time. An intelligent agent in 
this system can learn from two sources: examples from 
the environment, as well as information from other 
agents. One way to state the question addressed by the 
Data-Distributed Online Learning (DDOL) schemes 
we introduce can be informally described as follows: 
within an interconnected network of learning agents, 
although an agent only receives m samples of input 
data, can it be made to perform as if it has received 
M > m samples? Here the performance is measured 
by generalization abilities (prediction error or regret 
for the online setting). In other words, to what extent 
can an agent make fewer generalization errors by uti- 
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lizing information from other online-learning agents? 

This question can also be phrased another way. In re- 
cent years, the increasing ubiquity of massive datasets 
as well as the opportunities for distributed comput- 
ing (cloud computing, multi-core, etc.), have conspired 
to spark much interest in developing distributed algo- 
rithms for machine learning (ML) methods. While it is 
easy to see how parallelism can be obtained for most 
of the computational problems in ML, the question 
arises whether online learning, which appears at first 
glance to be inherently serial, can be fruitfully paral- 
lelized to any significant degree. While several recent 
papers have proposed distributed schemes, the ques- 
tion of whether significant speedups over the default 
serial scheme can be achieved has remained fairly open. 
Theory establishing or disallowing such a possibility is 
particularly to be desired. To the best of our knowl- 
edge, this paper is the first work that answers these 
questions for the general online learning setting. 

In this paper we show both theoretically and experi- 
mentally that significant speedups are possible in on- 
line learning by utilizing parallelism. We introduce a 
general framework for data-distributed online learning 
which encapsulates schemes such as weighted majority, 
online subgradient descent, and online exponentiated 
gradient descent. 

1.1. Related Work 

In an empirical study (Delalleau & Bengio. 2007), the 
authors proposed to make a trade-off between batch 
and stochastic gradient descent by using averaged 
mini-batches of size 10 ^ 100. A parameter aver- 
aging scheme was proposed in (Mann et al., 2009) to 
solve a batch regularized conditional max entropy op- 
timization problem, where the distributed parameters 
from each agent is averaged in the final stage. A dis- 
tributed subgradient descent method was proposed in 
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(Nedic & Ozdaglar, 2009) and an incremental subgra- 
dicnt method using a Markov chain was proposed in 
(Johansson ct al., 2009). In (Duchi et al., 2010), a dis- 
tributed dual averaging algorithm was proposed for 
minimizing convex empirical risks via decentralized 
networks. Convergence rates were reported for vari- 
ous network topologies. The same idea of averaged 
subgradients was extended to centralized online learn- 
ing settings in (Dekel et al., 2010). The problem of 
multi-agent learning has been an active research topic 
in reinforcement learning. In this paper, we will focus 
on the online supervised learning setting. 

2. Setup and DDOL Meta-Algorithm 

In this paper, we assume that each agent only has ac- 
cess to a portion of the data locally and communica- 
tions with other agents. Suppose we have N learning 
agents. At round t, the i*"^ learning agent Ai : i = 
1, . . . ,N receives an example x* G from the envi- 
ronment and makes a prediction y|. The environment 
then reveals the correct answer Z* corresponding to x* 
and the agent suffer some loss L(x*, y* , Z*). The pa- 
rameter set of an agent Ai at time f is w| G W. Each 
agent is a vertex of a connected graph G = {A,£). 
There will be a bidirectional communication between 
Ai and Aj if they are neighbors (connected by an edge 
Cij). Ai has Ni — 1 neighbors. 

The generic meta- algorithm for data-distributed on- 
line learning (DDOL) is very simple: each agent Ai 
works according to the following procedure: 



Algorithm 1 Distributed Online Learning (DDOL) 
for i 1,2,... do 

Ai makes local prediction(s) on example(s) x*; 
Ai Update w* using local correct answer(s) /*; 
Ai Communicate w* with its neighbors and do 
Weighted Average over w*, j ^ 1, . . . ,Ni; 
end for 



To derive a distributed online learning algorithm, one 
need to specify the Update, Communicate and Weighted 
Average schemes. In the following sections, we will 
discuss how to use two classic online learning methods 
as the basic Update scheme, and how the combination 
with different Communicate/Weighted Average schemes 
leads to different performance guarantees. 

3. Distributed Weighted Majority 

We firstly propose two expert-advise-based online clas- 
sification algorithms which can be regarded as dis- 



tributed versions of the classic Weighted Majority al- 
gorithm (WMA) (Littlcstone & Warmuth, 1989). For 
simplicity, we assume that in both algorithms, all the 
experts are shared by all the agents, and each agent 
is adjacent to all the other agents (G is a complete 
graph). 

Alg. 2 is named Distributed Weighted Majority by Im- 
itation (DWM-I). In the communication step, each Ai 
mimics other agent's operations by penalizing each 
expert p in the same way as any other agents do, and 
then makes a geometric averaging. The following re- 



Algorithm 2 DWM-h agent Ai 

1: Initialize all weights , . . . , w^^ of P shared ex- 
perts for agent Ai to 1. 
2: for 1,2,... do 

3: Given experts' predictions yl_^, . . . ^yj^ over x*, 
Ai predicts 

1—1, otherwise. 
5: Environment reveals for Ai- 

« V ~* ^ /"<'' ifyl (0<a<l) 
" , otherwise. 

8: end for 

suit gives the upper bound of the average number of 
mistakes by each agent, assuming that each agent is 
receiving information from all the other agents. 

Theorem 3.1. For Algorithm 2 with N agents and P 
shared experts, max^ < ^^^^ -i (^log^ +logP), 

where Mi is the number of mistakes by agent Ai and 
m* is the minimum number of mistakes by the best 
expert over all agents so far. 

Proof The proof essentially follows that of WMA. The 
best expert makes mistakes over all agents so 
far. So for any Ai, its weight of is . Upon each 
mistake made by any agent, the total weights J^p '^Ip 
of Ai decreases by a factor of at least ^(1 — a). So 
the total weights for Ai is at most P [l — i(l — a)] . 
Therefore for any i, < P {^-^)'^^' ■ It follows that 

M,<-^(^\og-+logp). (1) 
log V ^ a / 

Taking a = 1/2, 

M, < 2.41 (^+ log P). (2) 

□ 
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Comparing (2) with the result of WMA: M < 
2.41 (m* + logP); in the most optimistic case, if to* is 
of the same order as the number of mistakes made by 
the best expert in the single agent scheme, in other 
words, if makes error over ah agents other than 
Ai, then the upper bound is decreased by a factor of 
In the most pessimistic case, if i?* makes exactly 
the same number of mistakes over all N agents, then 
the upper bound is the same as with a single agent. 
This happens when all agents are receiving the same 
inputs and answers from the environment, hence there 
is no new information being communicated among the 
network and no communications are needed. In reality, 
TO* falls between these two extremes. 

Theorem 3.1 is stated from an individual agent point 
of view. From the social point of view, the total num- 
ber of mistakes made by all agents X]i!=i upper 
bounded by 



log -rj— 



TO, log — 

a 



TV log P 



(3) 



which is not larger than that in a single agent scheme 
(N log P can be ignored in comparing with the first 
term to, which could be very large in practice). Imag- 
ine that NT samples are processed in the single agent 
scheme, while in the N agents scheme, each Ai pro- 
cess T samples. In the most pessimistic case, upper 
bound (3) is the same for these two schemes. This 
is a very good property for parallel computing, since 
the proposed online DWM can achieve the same gen- 
eralization capacity, while being N times faster than 
a serial algorithm. This property is verified by the 
experiments in Section 5. 

As in the Randomized Weighted Majority algorithm 
(RWM) (Littlestone & Warmuth, 1989), we can intro- 
duce some randomness to our choices of experts by 
giving each expert a probability of being chosen de- 
pending on the performance of this expert in the past. 
Specifically, in each round we choose an expert with 
probability Pi = Wif^^ Wi. We can have a Distributed 
Randomized Weighted Majority and obtain a similar 
upper bound as that of RWM with a constant of 1/A^ 
as in Theorem 3.1. 

The upper bound (1) can be further improved by an al- 
ternative algorithm ( Alg. 3) , which differs from Alg. 2 
only in the way that each agent utilizes information re- 
ceived from others. Instead of mimicking other agents' 
operations, an agent now updates its weights by arith- 
metically averaging together with all the weights it 
received from its neighbors. 

Theorem 3.2. For Algorithm 3 with N 
agents and P shared experts, max^ Mi < 



Algorithm 3 DWM-A: agent Ai 



1 v^-/V ~t 



Op' 



whe 



the number of mistakes by agent Ai so far and to* is 
the minimum number of mistakes by the best expert 
at round t over all agents. 



Proof. Denote the weight of expert p for agent Ai at 
round t as w* . Indeed, 



N 
i=l 



N 



N 



TO" 



i=l ^ L 

^1,(1 - a) 



1 - 



N 



N 



= --- = N 



N 



_ ml{l-a) 
N 



Using 1 — X > exp(— a;/(l — cc)), Vx G (0, 1) and the 
fact that < mi, < N, we have for any agent Ai, 



n 



^ _ TO*(1 ~ a) 
N 



> 



-TO.^(1 ~ a) 
N - TO*(1 - a) 
(4) 



On the other hand, for any Ai, 

p 



p=i 



Mi 



(5) 



Since w*^ < Y.p=i combining (4) and (5), 



P 



1 / 

1 - -(1 - a) 
2^ ' 



Mi 



> exp 



TO*(l-a) 



^ iV-TO^(l-a) 
It follows that \/i^l...N, p^l...P 



M,, < 



1 



E 



(1 - a)ml 



logP . (6) 



□ 



Now we are ready to compare the refined bound (6) 
with (1) using to, > X)*'™*- Without considering the 
log P part of the bounds which is much smaller than 
the TO, part, it is easy to verify that ifl/2<a<l, 
then 

TO, log ^ (1 - a)ml 



N 



^ iV - (1 - a)TO* 



(7) 
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without any assumption on m*; IfO<Q;<l/2, then 
the above inequahty holds when 



mi < N 



1 



1 



1 — a log(l/Q:) 



(8) 



The RHS of (8) is lower bounded by O.SliV. Specifi- 
cally, when m* = 0{N/2) and by taking a = 1/2, the 
difference in (7) is 



TO* \ - 



N ^ '2N ~ TO* 



O 



Hence the error bound in Theorem 3.2 is much lower. 
Experimental evidence will be provided in Section 5. 

4. Distributed Online Mirror Descent 

In this section we extend the idea of distributed on- 
line learning to Online Convex Optimization (OCO) 
problems. OCO is an online variant of the convex op- 
timization, which is ubiquitous in many machine learn- 
ing problems such as support vector machines, logistic 
regression and sparse signal reconstruction tasks. Each 
of these learning algorithms has a convex loss function 
to be minimized. 



product (•,•). Using Bregman divergence '0(u, v) = 
a;(u) — w(v) — (Vw(v),u — v) as a proximity func- 
tion, the update rule of OMD can be expressed as 
wt+i argmiuzgw ?7t (gt, z - Wt) +'ijj{z,Wt), where 
gt is a subgradient of ft at Wt and rjt is a learning rate 
which plays an important role in the regret bound. 
Denote the dual norm of || • || as || • ||*. 

Suppose agent Ai has Ni — 1 neighbors. We pro- 
pose Distributed Online Mirror Descent algorithm in 
Alg. 4. In this algorithm, the update rule has ex- 
plicit expressions for some special proximity functions 
ip{-, •). Next we derive distributed update rules for two 
well-known OMD examples: Online Gradient Descent 
(OGD) and Online Exponentiated Gradient (OEG). 

Algorithm 4 DOMD: agent A 
Initialize wj G W 
for i= 1,2,... do 

Local prediction using w*. 



J+i 



end for 



arg min > 

i=i 



7?t(g*,Z- W*)+7/.(z,Wp 



One can consider OCO as a repeated game between 
an algorithm A and the environment. At each round 
t, A chooses a strategy wt € W and the environment 
reveals a convex function ft- Here we assume that all 
convex functions share the same feasible set W. The 
goal of A is to minimize the difference between the 
cumulation J^t fti'^t) and that of the best strategy w* 
it can play in hindsight. This difference is commonly 
known as external regret^ defined as below. 

Definition 4.1. The regret of convex functions f = 
{ft} for t = l,2,...,T is defined as R{T) = 
ELi /t (wt ) - inf we w ELi /* (w) ■ 

In the distributed setting, this game is played by every 
agent Ai, i = 1, . . . , N . The goal of Ai is to minimize 
its own regret Ri (T) , named the individual regret. We 
call the sum of individual regrets R{T) = J^Zi M^) 
social regret. 

We will present the online mirror descent (OMD) 
framework which generalize many OCO algorithms 
such as online subgradient descent (Zinkevich, 2003), 
Winnow (Littlestone, 1988), online exponentiated gra- 
dient (Kivinen & Warmuth, 1997), online Newton's 
method (Hazan et al., 2006). 

We firstly introduce some notations used in this sec- 
tion. A distance generating function uj{u) is a contin- 
uously differentiable function that is a-strongly con- 
vex w.r.t. some norm i| • 11 associated with an inner 
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Taking ip{u,v) — jlju — v||| (i.e. the prox- 



imity is measured by squared Euclidean distance), 
an agent Ai needs to solve the minimization 



Ni 



Vt [Si 



12 I : 



to a simple DOGD updating rule 



1 



which leads 



(9) 



Distributed OEG 

Taking the unnormalized relative entropy as the 
proximity function '0(u, v) = Efci I'^'^rf ~ 
Efci ''^d Inurf— (In v+/)"^(u— v), we can solve the min- 
imization miuz Y.j'li [Vt (g* , z - w*) In Zd- 
J2d=i )d ln(w* )<j - (In w* +/)^ (z- w* )] , and obtain 
the update rule for DOEG: 



Ni ^ l/Ni 



(10) 



If the feasible set W is a simplex ball || vir|| i < S instead 
of M^, one only needs to do an extra normalization: 

Vd=l,...,I?,(w*+i)rf 



^ , ^ if ||w*+i||i >^. 



Updating rules (9) and (10) share the same spirit as 
stated in the meta-algorithm 1: each agent updates 



Data-Distributed Weighted Majority and Online Mirror Descent 



its parameters individually, then it averages with 
its neighbors' parameters, either arithmetically, or ge- 
ometrically. The following results shows how this sim- 
ple averaging scheme works, in terms of average in- 
dividual regrets J2i ^ii"^) ■ ™ theorem 3.1 and 
theorem 3.2, for simplicity, we assume that the graph 
G is complete, i.e. each agent has — 1 neighbors. 

Lemma 4.2. (Nemirovsky & Yudin, 1983) Let 
Pw(u) = argmiuzgw (u, z — w) +-!/;(z,w), for any 
V, v^r e W and u £ one has 

(u,w-v) <V>(w,v)-V^(P„(u),v) + ^. (11) 

la 

Theorem 4.3. // N agents in Algorithm 4 are con- 
nected via a complete graph, ft are convex, distances 
between two parameter vectors are upper bounded 
supj J t -i/) (w* , w* ) = D, let rjt = then the average 
individual regret 

N T , N ^ 

2—1 t—1 ^ ^ J — 1 ^ 

(12) 

Proof. Since G is complete, at a fixed t, 
vf\ is the same for any i. Hence w*"''^ = 
argmin2;gw Y.^li [vt (g*, z - w*) + V'(z, w*)] 
argminzgw(2i.^^^^g*, z - w*) + ^/'(z,w*) = 

^wj(tEf=igl)- Letu= tE.tigJ. v = w*, w = 
w* in (11), we have 

< i [^(w*,w*)-^.(w,r\w*)] +^iii5:g*ii^ 

Using the convexity of ft and summing the above in- 
equality over t we have 

N T N 

i=l t=l 3 = 1 

T N 

^E(4Eg5-' w*-w*)<lv^(wi,w*)- 
t=i j=i '1 

^^(wf+\w*)+ f--— V(w*,w*)-t- 

^ ( 2^E^*"*)' 

t=i ^ j=i / 

Setting rjt ~ 1/ \/t and using the assumption on the 
upper bound of ■)/'(•, •) we reach the result. □ 



To appreciate the above theorem, we further as- 
sume that the subgradient is upper bounded: 
sup.j^gyy j^]^ 2 ... Il5*(w)|| * = G. In the most opti- 
mistic case, at a given round t, if the subgradients 
g*, j — 1,...,N are mutually orthogonal, then the 
second term of the upper bound (12) can be bounded 
by 2^C!^^y^, which is 1/A^ times smaller than using 
a single agent. In the most pessimistic case, if all the 
subgradients g*, j = 1, . . . ,N are exactly the same, 

then the second term is bounded by ■^G^\/T, which 
is the same as in a single agent scheme. 

According to the regret bound (12), the social regret 

EliR.{T) < nd^Vt + ^j:1, {rjtW Ef=ig*lD. 

In the most optimistic case, the bound is ND'^\/T + 
C Y.l^=i ''It - {ND'^ + ^)VT. In the most pessimistic 
case, the bound becomes {ND^ + ^^)VT. 

Imagine that NT samples need to be processed. In 
the single agent scheme, they are accessed by only 1 
agent, while in the N agents scheme, these NT sam- 
ples are evenly distributed with each Ai processing 
T samples. In the most optimistic case, the bound 
for the N agent scheme is {ND"^ + ^)y/T, while in 

the most pessimistic case, it is [ND^ -f N^)\/T. In 
comparison, the bound for the single agent scheme is 
{D^^/N + g"vW )VT. We cannot draw an immediate 
conclusion of which one is better, since it depends on 
the correlations of examples, as well as D and G. But 
it is clear that the N agent scheme is at most \/N 
times larger in its social regret bound, while being A'^ 
times faster. 

5. Experimental Study 

In this section, several sets of online classification ex- 
periments will be used to evaluate the theories and 
the proposed distributed online learning algorithms. 
Three real-world binary-class datascts ^ from various 
application domains arc adopted. Table 1 summarizes 
these datascts and the parameters used in section 5.2. 



Table 1. Dataset facts and parameters. 



Name 


# 


D 


Non-0 


Balance 


C 


S 


svmguidel 


3,089 


4 


100% 


1: 64.7% 






cod-rna 


59,535 


8 


100% 


-1: 66.7% 


le-2 


le4 


covtype 


522,910 


54 


22% 


-1: 51.2% 


le4 


le4 



To simulate the behavior of multi-agents, we use 
Pthreads (POSIX Threads) for multi-threaded pro- 
gramming, where each thread is an agent, and they 
communicate with each other via the shared mem- 
ory. Barriers are used for synchronizations. All exper- 



www. csie .ntu. edu. tw/~ cjlin/libsvmtools/datasets/ 



Data-Distributed Weighted Majority and Online Mirror Descent 



imciits are carried out on a workstation with a 4-core 
2.7GHz Intel Core 2 Quad CPU. 

5.1. Distributed Weighted Majority 

To evaluate the proposed DWM algorithms, the sim- 
plest decision stumps are chosen as experts, and all 
the experts are trained off-line. We randomly choose 
P < D dimensions. Within each dimension, 200 
probes are evenly placed between the min and max 
values of this dimension. The probe with the mini- 
mum training error over the whole datasct is selected 
as the decision threshold. In all the following weighted 
majority experiments, we choose the penalty factor 
a = 0.9. 

The first set of experiments report the behaviors of 
DWM-I and DWM-A from the individual agent point 
of view. Each agent share the same P = 4 experts, and 
communicates with all the others. Fig. 1 and 2 depict 
the cumulative number of mispredictions made by each 
thread as a function of the number of samples accessed 
by a single agent, where 1, 2, 3 and 4 agents are com- 
pared. Each plot in a subfigure represents an agent. It 
is clear that an agent Ai makes fewer mistakes Mi as 
it receives more information from its neighbors. With 
4 agents, Mi is reduced by half comparing with the 
single agent case. This provides some evidence for the 
1/iV error reduction as stated in Theorem 3.1. 




200 400 

# of samples accessed 




400 



3 agents 




200 



400 



40 
20 




4 agents 



200 



400 



Figure 1. DWM-I: cumulative mistakes on svmguidel. 

As discussed in Section 3, from the social point of view, 
with the same number of samples accessed, the bound 
(3) of the total number of mistakes made by all agents 
^'^i) is almost as small as that in a single agent 
case. The comparisons for both DWM-I and DWM- 
A are illustrated in Fig. 3. This result is not surpris- 
ing, since no more information is provided for multiple 
agents, and one should not hope that J2i -^^i rnuch 
lower than M. But on the other hand, the DWM algo- 
rithms achieve the same level of mistakes, while they 




200 400 

# of samples accessed 




400 




400 



4 agents 



200 



400 



Figure 2. DWM-A: cumulative mistakes on svmguidel. 

are N times faster. It can also be observed from Fig. 
1, 2 and 3 that DWM-A makes slightly fewer mistakes 
than DWM-I. 



DWM-I 



Si? 200 



J: 100 




500 1000 1500 

total # of samples accessed 
DWM-A 



2000 



Sfi 200 



J: 100 



S 



1 agent 

5 agents 







500 1000 1500 

total # of samples accessed 



2000 



Figure 3. svmguidel: total # of mistakes over all agents. 

To verify the tightness of bound (1) and the refined 
(6), we compare the number of mistakes m, make by 
the best expert over all agents with that of a single 
agent Mi. Fig. 4 shows that with = 2 or 5 agents, 
m* is around 2 or 5 times larger than Mi , which means 
Mi sa mt:/N . However, choosing a = 0.9 in bound (1) 
leads to Mi < 2.05to*/7V. This shows that the bound 
in Theorem 3.2 is indeed tighter than Theorem 3.1. 

5.2. Distributed Online Mirror Descent 

In this section, several online classification experi- 
ments will be carried out using the proposed DOGD 
and DOEG algorithms. For DOGD, we choose the L2- 
regularized instance hinge loss function as our convex 
objective function: 

/t(w) = Cmax{0, 1 - kw^^t} + \H\l/2. 
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2000 



2000 r 




10000 



Figure 5. DOGD; cod-rna; mispredictions 



Figure 4. svmguidel; 
multi-agents. 



^ of mistakes: best expert v.s. 



For DOEG, we take /t(w) = max {0, 1 — ZtW^Xt} and 
W = {w : ||w||i < S}. Since the update rule (10) 
cannot change the signs of Wj, we use a similar trick 
like E'G^ proposed in (Kivinen & Warmuth, 1997), i.e. 
letting w = w"*" — w~, where w+,w~ > 0. Since we 
will not compare the generalization capacities between 
these two algorithms, in all the following experiments, 
the parameters of C and S are chosen according to 
Table 1 without any further tuning. The subgradient 
of the non-smooth hinge loss is take as gt = —h^t if 
1 — liW^iit > and otherwise. 

We firstly illustrate the generalization capacities 
of DOGD and DOEG. Since we do not know 
infwGW X^tli fti'^)^ it is not easy to calculate the indi- 
vidual regret or social regret. Hence we only compare 
the number of mispredictions and the average accu- 
mulated objective function values as functions of the 
number of samples accessed by a single agent. The 
results are shown in Fig. 5^8. 

It is clear that for both DOGD and DOEG, the number 
of mispredictions decreases when more agents commu- 
nicate with each other. The average objective values 
jjfKw'l) also decrease with the increasing number of 
agents N. However, as shown in Fig. 8, when N = 32, 
the averaged -^/((wj) is larger than N = 16. This 
might be due to the insufficient number of samples of 
the dataset cod-rna. This conjecture is experimentally 
verified in Fig. 9, where the size of covtype is 522910. 

As discussed at the end of Section 4, the social regret 
bound of agents is at most i/iV times larger than 
that of a single agent scheme. The next set of experi- 
ments will be used to verify this claim. Fig. 10 depicts 
the result. We can see that the total loss J2iLi fti^t) 



3.8564 



3.8564 



X 10 



3.8564 



1 agent 



2.3292 



2.3292 



X 10 



4 agents 



1.1098 





X 10^ 



2.3292 
5000 10000 



5000 



10000 



1.1098 



8 agents 



4.1764 



4.1764 



4.1764 



X 10 



5000 10000 

# of samples accessed 



5000 



10000 



Figure 6. DOGD; cod-rna; average objective values 




10000 



10000 



10000 



Figure 7. DOEG; cod-rna; mispredictions 

for 8, 16, 32 is even lower than using a single 

agent. A^ = 64 is slightly higher, but the difference is 
still much lower than the theoretical -\/64- This sug- 
gests that there might exist a bound tighter than (12). 
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X 10 





1 agent 






2 agents 






4 agents 






8 agents 

1 6 agents 






32 agents 









2000 4000 6000 8000 10000 

# of samples accessed 



Figure 8. DOEG; cod-rna; average objective values 



8000 




2000 4000 6000 8000 10000 

# of samples accessed 



Figure 9. DOEG; covtype; average objective values 




2 4 6 8 10 12 14 16 
total # of samples accessed ^ 



Figure 10. DOEG; covtype; total objective values 

6. Conclusions and Future Work 

We proposed a generic data-distributed online learn- 
ing meta-algorithm. As concrete examples, two sets 
of distributed algorithms were derived. One is for dis- 
tributed weighted majority, and the other is for dis- 
tributed online convex optimization. Their effective- 
ness is supported by both analysis and experiments. 

The analysis shows that with N agents, DWM can 
have an upper error bound that is 1/N lower than 
using a single agent. From the social point of view, 
the bound of total number of errors made by all 
agents is the same as using 1 agent, while processing 
the same amount of examples. This indicates that 
DWM attains the same level of generalization error as 



WM, but is N times faster. 

The average individual regret for DOMD algorithms 
is also much lower than OMD, although it is not 1/iV 
lower as in DWM. In the worst case, the bound of 
social regret is at most ^/N higher than using a single 
agent. 

In follow-on work, two assumptions made in this pa- 
per will be removed to make the proposed algorithms 
more robust in practical applications. Firstly, as dis- 
cussed in (Duchi et al., 2010), the connected graph G 
does not need to be complete. We are working on 
distributed active learning and active teaching, which 
might lead to a data-dependent communication topol- 
ogy. Secondly, the learning process should be fully 
asynchronous. This brings up the problem of 'delays' 
in label feedbacks (Mesterharm, 2005; Langford et al., 
2009). Moreover, for OCO, with more structural in- 
formation on ft rather than the black-box model, we 
might be able to find better distributed algorithms and 
achieve tighter bounds. 
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